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Abstract 

Background: Recent studies have shown that human populations have experienced a complex demographic 
history, including a recent epoch of rapid population growth that led to an excess in the proportion of rare 
genetic variants in humans today. This excess can impact the burden of private mutations for each individual, 
defined here as the proportion of heterozygous variants in each newly sequenced individual that are novel 
compared to another large sample of sequenced individuals. 

Results: We calculated the burden of private mutations predicted by different demographic models, and 
compared with empirical estimates based on data from the NHLBI Exome Sequencing Project and data from the 
Neutral Regions (NR) dataset. We observed a significant excess in the proportion of private mutations in the 
empirical data compared with models of demographic history without a recent epoch of population growth. 
Incorporating recent growth into the model provides a much improved fit to empirical observations. This 
phenomenon becomes more marked for larger sample sizes, e.g. extrapolating to a scenario in which 10,000 
individuals from the same population have been sequenced with perfect accuracy, still about 1 in 400 
heterozygous sites (or about 6,000 variants) at the 10,001 st individual are predicted to be novel, 18-times as 
predicted in the absence of recent population growth. The proportion of private mutations is additionally 
increased by purifying selection, which differentially affect mutations of different functional annotations. 

Conclusions: The burden of private mutations for each individual, which are singletons (i.e. appearing in a single 
copy) in a larger sample that includes this individual, is predicted to be greatly increased by recent population 
growth, as well as by purifying selection. Comparison with empirical data supports that European populations have 
experienced recent rapid population growth, consistent with previous studies. These results have important 
implications to the design and analysis of sequencing-based association studies of complex human disease as they 
pertain to private and very rare variants. They also imply that personalized genomics will indeed have to be very 
personal in accounting for the large number of private mutations. 




Genomics 



Background 

Many recent studies that sequenced large numbers of indi- 
viduals have shown that human populations have experi- 
enced a complex demographic history, including a recent 
epoch of rapid growth in effective population size, 
although estimates have varied greatly among studies 
[1-7]. The growth of European population has recently 
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been estimated to be exponential with a rate of 2-5% per- 
generation increase in population size [1,3,7]. This recent 
growth has resulted in an excess of rare single nucleotide 
variants (SNVs), commonly defined as those with a minor 
allele (the less common of the two alleles) frequency 
(MAF) of less than 0.5% (or 1%) in a sample of individuals 
from the same population [e.g. [5,8]]. The proportion of 
singletons (SNVs with only one copy in the entire sample) 
is especially elevated due to this recent rapid growth 
[1,3,5,7,9]. Consequently, the corresponding site frequency 
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spectrum (SFS), a summary statistic that indicates the pro- 
portion of variants of each possible allele count in the sam- 
ple (e.g. Figure 1) is skewed towards lower allele counts. 

A predicted consequence of the skew in the SFS due to 
population growth is an increase in the burden of private 
mutations for each individual. We recently defined this 
quantity as the proportion of heterozygous positions in 
each newly sequenced individual that are novel, i.e., com- 
pletely absent from a previously sequenced sample from 
the same population [9]. In that previous paper, we 
observed this burden to be higher in samples from popu- 
lations of European and East Asian descent than is pre- 
dicted by previously estimated demographic models that 
do not include an epoch of recent population growth [9]. 
However, empirical estimates in that paper were based 
on a small sample size of less than 100 individuals, while 
the contribution of recent rapid growth is expected to be 
more pronounced for larger sample sizes [1-6,9]. 

Here, we set out to (1) empirically estimate the burden 
of private mutations from large samples of individuals of 
European ancestry, (2) compare these estimates with pre- 
dictions of previously proposed demographic models 
with and without a recent epoch of exponential growth 
[3,10], and (3) contrast SNVs of different functions that 
are expected to have undergone different selective effects. 
As purifying, negative selection on deleterious SNVs 
skews the SFS towards rare variants [1,5,11-13], it can 
interact with the effect of recent population growth in 
increasing the burden of private SNVs, and differently so 
for different functional categories. With the rapidly 



decreasing cost of sequencing, more and more high-qual- 
ity sequencing data sets of large sample sizes and 
improved accuracy of detecting rare variants become 
available. This provides an excellent opportunity for a 
more accurate study of the burden of private mutations. 
In this paper, we considered two such sequencing data 
sets of samples from populations of European ancestry: 
the NHLBI Exome Sequencing Project (ESP) [1][7] and 
the Neutral Regions (NR) data set of putatively neutral 
regions [3]. 

Results and discussion 

In all analyses, we contrast three different demographic 
models and the fit of their predictions to the NR data set 
[3] and to 7 functional categories of the ESP data set 
[1,7]. The three demographic models are (1) a population 
that has been of constant population size throughout his- 
tory, (2) a model of European history that includes two 
population bottlenecks [10], and (3) a model of European 
history with two bottlenecks, a recent change in popula- 
tion size, followed by a recent epoch of rapid population 
growth [3] (Model II therein). 

Comparison of site frequency spectra 

As the burden of private mutations is a function of the 
site frequency spectrum, we first contrasted the site fre- 
quency spectra between three demographic models, the 
NR data [3], and the ESP data [1,7] (Figure 1). In order to 
allow comparison of the data sets with different sample 
sizes, as well as account for missing genotype calls for 
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Figure 1 Site frequency spectra of demographic models and data with a sample size of 900. The SFS for 3 demographic models, the 
Neutral Regions (NR) data and 7 categories of the Exome Sequencing Project (ESP) data. To adjust for the different sample sizes in the two datasets, 
probabilistic subsampling was applied to make all sample sizes equal to 900 chromosomes. Only the first 10 minor allele count categories are shown. 
For each minor allele count, from left to right: constant population size, European history with 2 bottlenecks but no growth [10], European history with 
recent growth (Model II in [3]), the NR data, intergenic SNVs of the ESP data, intron SNVs of the ESP data, synonymous SNVs of the ESP data, UTR SNVs 
of the ESP data, missense SNVs of the ESP data, nonsense SNVs of the ESP data and splice SNVs of the ESP data. 
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each SNV, we probabilistically subsampled all data to a 
sample size of 900 haploid chromosomes (Methods). 

The proportion of singletons from demographic mod- 
els (1) and (2) is greatly lower than that in the observed 
data and that predicted by model (3), where recent 
growth is incorporated (Figure 1). Among the categories 
of the ESP data, categories that are expected to be more 
functional show a higher proportion of singletons, e.g. 
intronic, intergenic, synonymous, and UTR SNVs have a 
significantly lower proportion than non-synonymous, 
nonsense, and splice SNVs (Figure 1), which is expected 
by the latter being more often deleterious. These results 
recapitulate those from the ESP [7]. The proportion of 
singletons in the SNVs from the NR data is lower than 
all categories of SNVs from ESP, which is consistent 
with the former being designed such that variants are 
very far from genes and putatively neutral [3], while the 
latter consists of variants in and near protein-coding 
genes [1,7], which are expected to more often be tar- 
geted by purifying selection. Another factor that can 
contribute to this difference between the NR and ESP 
datasets is that the former aimed to capture a sample of 



homogenous ancestry, which corresponds to North- 
Western European ancestry [3], while the latter consists 
of a broad sample of European Americans that exhibits 
a higher level of population structure [1,7]. Increased 
population structure can lead to an increase in the pro- 
portion of rare variants since some of these can be due 
to mutations that postdate the split of the population 
captured by the different ancestries [3]. 

Comparison of the burden of private mutations 

The predicted burden of private mutations for each indivi- 
dual from all demographic models and the empirical bur- 
den observed in the different data sets and functional 
categories are presented in Figure 2. Across all sample 
sizes, the burden of private mutations from empirical data 
is significantly higher than that predicted by demographic 
models without growth. For example, based on the results 
of the NR data, when 100 individuals have been 
sequenced, we estimated that about 1.4% out of all hetero- 
zygous sites in the 101 st sequenced individual are novel, 
that is specific to the 101 st individual and completely 
absent from the first set of 100 individuals. While models 
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Figure 2 The burden of private mutations of demographic models and empirical data. The burden of private mutations for the same 
demographic models and empirical data as in Figure 1 , using the same colors. This quantity corresponds to the percentage out of all heterozygous 
sites in a newly sequenced genome that are novel after n genomes have already been sequenced. Results are presented for n = 100, n = 492, 
n = 1000, n = 4299 and n = 10000. The value of 492 and 4299 are dictated by the sample size of the NR and ESP dataset, respectively. For empirical 
data, mean percentage across individuals is presented, together with error bars that denote ± one standard error across SNVs, estimated via 
bootstrapping (Methods). Double-slashes around a value of 0 on the x-axis represent instances where data for that sample size is not available in the 
respective dataset. Note that the range above 5% on the y-axis is rescaled. The corresponding values in this figure are shown in Table 1 . 
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(1) and (2) predict only 1% in this scenario, model (3) is 
consistent with this estimate in the NR data. 

For all demographic models and observed data, as more 
individuals are sequenced, the burden of private muta- 
tions decreases (Figure 2), because increasing sample size 
makes it more probable that a variant has already been 
discovered [9]. At the same time, the effect of recent 
growth itself on the burden of private mutations is much 
more pronounced with increasing sample size. For exam- 
ple, for the NR data, when 492 individuals are sequenced, 
the estimated burden of mutation from the 493 rd 
sequenced individual is about 0.76% (Table 1). The esti- 
mations from models (1) and (2) are only 0.20% and 
0.26%, respectively, about a third of empirical data, while 
model (3) matches the data well. We note that this 
percentage varies greatly across individuals with the rela- 
tively small number of SNVs in the NR data (Table 2). 

When extrapolating the models to consider a scenario in 
which 10,000 individuals are sequenced, model (3) predicts 
the burden of mutations of the 10,001 st individual to be 
0.24% (Table 1), 24-times and 18-times that from models 
without recent growth that predict 0.010% and 0.013% 
based on models (1) and (2), respectively (Table 1). This 
corresponds to almost 1 of 400 heterozygous positions, 
which is equivalent to about 6,000 variants genome-wide. 
This estimate is at least two orders of magnitude larger 
than the expected number of de novo mutations of each 
individual [e.g. [14]]. Hence, we predict that thousands of 
novel variants will be discovered in each newly sequenced 
genome even after tens of thousands of genomes from 



exactly the same population have already been sequenced 
with perfect accuracy, and that these are rarely due to de 
novo mutations. 

Another important observation is that the burden of 
private mutations for each individual calculated from all 
seven categories of the ESP data is consistently higher 
than that from the NR data for all sample sizes (Figure 2). 
This is consistent with the observation that the SFS of the 
ESP data are more left-skewed than those of the NR data, 
which is consistent with decreased effect of purifying 
selection and population structure on the latter. Compar- 
ing the different ESP categories, splice and non-sense 
SNVs, which are expected to most often be deleterious, 
have the largest burden of private mutations across all 
sample sizes. Similarly, the burden of all functional cate- 
gories is ordered by common expectations as to how often 
such mutations are expected to be functional. The burden 
of private mutations captures a unique summary of the 
SFS that more clearly shows the effect of purifying selec- 
tion. For example, when n = 492, the proportion of single- 
tons is 46.2% for the ESP intergenic SNVs and 74.8% for 
the ESP splice SNVs, which is 1.6-fold. In comparison, the 
burden of private mutations for splice SNVs is about 
9.7-fold of that for intergenic SNVs. This difference is 
even more pronounced when the sample size is larger, 
with 12.7-fold different when n = 4299 (Figure 2). 

Conclusions 

Recent whole-genome sequencing data sets show that the 
proportion of rare variants in large samples, especially 



Table 1 Estimated mean and standard error of percentage of private mutations for each individual. 


Group 


n = 100 


n = 492 


n = 1000 


n = 4299 


n = 10000 


Constant Population 


0.995% 


0.203% 


0.100% 


0.023% 


0.010% 


Size Model 












European History with Two Bottlenecks 


1 .092% 


0.257% 


0.129% 


0.031% 


0.013% 


European History with Recent Growth 


1 .406% 


0.750% 


0.596% 


0.349% 


0.237% 


NR Data 


1 .444% 


0.756% 


NA 


NA 


NA 




(0.106%) 


(0.043%) 








ESP Intergenic 


2.132% (0.123%) 


1.125% 


0.835% 


0.496% (0.019%) 


NA 






(0.049%) 


(0.034%) 






ESP Intron 


2.233% (0.022%) 


1.171% 


0.922% 


0.528% 


NA 






(0.009%) 


(0.007%) 


(0.004%) 




ESP Synonymous 


2.366% 


1 .252% 


0.974% 


0.573% 


NA 




(0.026%) 


(0.012%) 


(0.009%) 


(0.005%) 




ESP UTR 


2.492% 


1 .305% 


1.004% 


0.596% 


NA 




(0.079%) 


(0.034%) 


(0.025%) 


(0.014%) 




ESP Missense 


4.482% 


2.632% 


2.121% 


1 .333% 


NA 




(0.049%) 


(0.026%) 


(0.019%) 


(0.011%) 




ESP Nonsense 


1 0.04% 


7.37% 


6.00% 


4.46% 


NA 




(0.92%) 


(0.68%) 


(0.50%) 


(0.38%) 




ESP Splice 


14.36% 


10.91% 


8.41% 


6.31% 


NA 




(2.29%) 


(1.50%) 


(1.19%) 


(0.91%) 





The burden of private mutations for n = 100, n = 492, n = 1000, n = 4299 and n = 10000, the corresponding values for Figure 2 and shown here for 
completeness. The number in parenthesis denotes the standard error across SNVs estimated via bootstrap (Methods). NA indicates that the data for that sample 
size is not available in the respective dataset. 
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Table 2 The mean and standard deviation of the burden 
of private mutations across individuals. 



Group 



The Burden of Private 
Mutations 



Constant Population Size Model 


0.208% (0.299%) 


European History with Two 


0.276% (0.352%) 


Bottlenecks 




European History with Recent 


0.736% (0.614%) 


Growth 




NR Data 


0.758% (0.852%) 



The burden of private mutations and the standard deviation of the sample for 
three demographic models and the NR data. The results correspond to n = 
492, the sample size of the NR data less one, as they are based on the 
individuals from that dataset. These results are not based on randomized 
chromosomes, but rather on the actual genotype information for each 
individual in turn. For the three demographic models, sequences were 
simulated with the same number of SNVs as in the NR data (Methods). The 
number in parenthesis denotes the standard deviation of the sample. These 
large standard deviations suggest a significant variation in percentage of 
private mutations across individuals when the small number of SNVs from the 
NR dataset is considered. 



singletons, is significantly elevated compared with the 
prediction from the standard coalescent theory that 
assumes a constant population size and from previous 
demographic models without recent growth [1,3,7,9]. 
Recent demographic modeling studies predict that 
humans have experienced a recent and rapid population 
growth, which explains an increased proportion of single- 
tons and other rare variants [1-6]. In this paper, we 
examined the burden of private mutations for each indi- 
vidual, a statistic that reflects the relationship between 
the relative proportions of singletons and more common 
variants contained in a sample, with three demographic 
models and two data sets under different sample sizes. 
We found that the burden of private mutations calcu- 
lated from empirical data and estimated from demo- 
graphic models with a recent growth is significantly 
higher than that estimated from models without recent 
growth across all sample sizes. The discrepancy is pre- 
dicted to be much more pronounced for larger number 
of sequenced individuals. We showed that this finding is 
consistent with a recent epoch of population growth. 
Moreover, we found that the SNVs that are affected by 
stronger purifying selection will generally have larger 
burden of private mutations compared with more selec- 
tively neutral SNVs, since they will have a higher propor- 
tion of singletons. 

The proportion of private mutations that we consider 
translates to the number of novel variants expected to 
be ascertained with each newly sequenced genome. 
Hence, our results have implications to sequencing- 
based association studies of complex human diseases 
and other sequencing studies. For instance, we predict 
that even after 10,000 individuals from the exact same 
European population have been perfectly sequenced, still 
1 in 400 heterozygous sites will be novel in each newly 



sequenced genome, which corresponds to discovering 
about 6,000 new variants. This large expectation is due to 
the effect of the recent rapid growth of European popula- 
tions, which leads to this number being at least 18-fold 
that predicted in the absence of such growth. Hence, 
careful consideration must be given to private mutations 
in the design and analysis of sequencing-based associa- 
tion studies and in quantifying the role played by rare 
variants in complex human disease [15-19]. 

Methods 

Datasets 

Two data sets were used in this study. The NR data 
contains the genotypes of 493 European individuals with 
high homogeneity on relatively neutral SNVs of 15 
genetic regions [3]. For quality purposes, all SNVs with 
less than 900 successful genotype counts were filtered 
from the analysis. The remaining 1,746 SNVs constitute 
95% of all variants [3]. The summarized data of 4,300 
European individuals from NHLBI Exome Sequencing 
Project records the minor allele count and major allele 
count of each SNV identified in 15,585 genes on all 
chromosomes (including chromosome x and Y) [1,7]. In 
this analysis, we combined all of the autosomal SNVs 
according to the 7 categories: intergenic, intron, mis- 
sense, nonsense, splice, synonymous and UTR. For qual- 
ity purpose, SNVs are filtered if the average read depth 
is less than or equal to 20 or the successful genotype 
counts are less than 8,170 (95%). 

Subsampling approach 

In order to compare the SFS of data with different sam- 
ple sizes (including the different sample sizes across the 
SNVs caused by unsuccessful genotype counts in the 
same data set), all the observed data were subsampled 
to 900 chromosomes. Following the strategy used in 
[10], for a SNV with /' minor alleles out of n successful 
genotype counts, the probability that it is of x minor 
alleles when subsampled to m chromosomes is 



P(xotm) - 



S(x, m — x) 



\xj\m-xj \m — xj\ x J 
V \m) \mj ) 

where S(a, b) = 1 if a - b and S(a, b) = 0 if a * b, 
x = 0, 1,2,..., [y] and^ := 0 if a < b. 

Expected SFS and the burden of private mutations for 
demographic models 

The SFS of the three demographic models were calculated 
using exact computation [20] instead of simulations. 

For a demographic model with constant population 
size, the burden of private mutations can be derived 
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under standard coalescent theory [21]. For constant 
population size, the expected number of singletons of a 
folded SFS for a sample of (« + 1) diploid individuals is 



E[iji] = 0 1 + 



2n+ l) 



where 9 = 4?Nft. The expected number of singletons 
that belong to one individual is 



n + 1 



E[m 



29 



In + 1 



The expected number of heterozygote sites for the 
pair of sequences from one individual E[h] = 9. Thus 
the expected burden of private mutations is 



E[a] 



E[s] 



E[h] 2n+l 
For variable population size, the general solution is 

1 E [T2 n+ 2,l] + E [T2„ + 2,2n+l] 



E[a] 



n + 1 



E[T 2A ] 



where T M stands for the total length of all branches 
in the coalescent tree which have exactly q descents out 
of the total number of descents p. The branch lengths 
are calculated by exact computation [20]. 

Computation of the burden of private mutations using 
data sets and simulations 

For the NR data, for each of the 493 individuals, the 
burden of private mutations a is directly calculated by 
the proportion of heterozygote sites which contain sin- 
gletons using the individual genotypes. Missing geno- 
types were abandoned. The mean and standard deviation 
of a for this sample were then calculated by 



1 



a = — a,-, 5 (a) 



1=1 



En y 
i=l (a» 



1 



where n is the sample size and equals 493 here. 

For ESP data and demographic models, as the indivi- 
dual genotypes were not available, sequences were simu- 
lated by distributing the minor alleles of each SNV to 
individuals randomly and independently. Unsuccessful 
genotype calls (missing genotypes) were also distributed 
randomly to the individuals but were distributed in 
pairs. In other words, the genotypes of each individual 
at each site either were both existent or both missing. 
Then a was calculated using these simulated sequences 
in the same way as for the NR data. 

For the demographic histories from which we can only 
get the SFS, a similar method is applied. Namely we 
simulated a certain number of SNVs according to the 



SFS and randomly assigned the minor alleles into indivi- 
dual sequences. The simulated sequences were paired 
randomly to form the sequences of an individual and a 
for each individual was then calculated. 

To calculate a for a smaller sample size m, m individuals 
were randomly chosen from the original n individuals 
and a was calculated using the genotypes from these 
m individuals with the previously stated approach. 

To study the effects of limited sites, a bootstrap 
approach was applied. Specifically, we resampled indivi- 
dual SNPs with replacement for 1,000 times. For each 
bootstrap, we calculated the average a (a b ,<) across all indi- 
viduals and these 1,000 averages were used to calculate the 
mean and standard deviation of the bootstrap, the latter of 
which is an estimate of the standard error of the sample: 



«b = — > ab,i/Sb (a) = ,/ 

rib t-j* V «b - 1 

where is the number of bootstraps and equals 
1,000 here. 
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